SGI Freeware 2002 November

home *** CD-ROM | disk | FTP | other *** search

/ SGI Freeware 2002 November / SGI Freeware 2002 November - Disc 2.iso / dist / fw_gsl.idb / usr / freeware / info / gsl-ref.info-9.z / gsl-ref.info-9

Wrap

Text File | 2000-10-09 | 50KB | 1,185 lines

This is gsl-ref.info, produced by Makeinfo version 3.12h from gsl-ref.texi. INFO-DIR-SECTION Scientific software START-INFO-DIR-ENTRY * gsl-ref: (gsl-ref). GNU Scientific Library - Reference END-INFO-DIR-ENTRY This file documents the GNU Scientific Library. Copyright (C) 1996, 1997, 1998, 1999 The GSL Project. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Foundation. File: gsl-ref.info, Node: Sorting objects, Next: Sorting vectors, Up: Sorting Sorting objects =============== The following function provides a simple alternative to the standard library function `qsort'. It is intended for systems lacking `qsort', not as a replacement for it. The function `qsort' should be used whenever possible, as it will be faster and can provide stable ordering of equal elements. Documentation for `qsort' is available in the `GNU C Library Reference Manual'. The functions described in this section are defined in the header file `gsl_heapsort.h'. - Function: void gsl_heapsort (void * ARRAY, size_t COUNT, size_t SIZE, gsl_comparison_fn_t COMPARE) This function sorts the COUNT members of the array ARRAY, each of size SIZE, into ascending order using the comparision function COMPARE. The type of the comparison function is defined by, int (*gsl_comparison_fn_t) (const void * a, const void * b) A comparison function should return a negative integer if the first argument is less than the second argument, `0' if the two arguments are equal and a positive integer if the first argument is greater than the second argument. For example, the following function can be used to sort doubles into ascending numerical order. int compare_doubles (const double * a, const double * b) { return (int) (*a - *b); } The appropriate function call to perform the sort is, gsl_heapsort (array, size, sizeof(double), compare_doubles); Note that unlike `qsort' the heapsort algorithm cannot be made into a stable sort by pointer arithmetic. The trick of comparing pointers for equal elements in the comparison function does not work for the heapsort algorithm. The heapsort algorithm performs an internal rearrangement of the data which destroys its initial ordering. - Function: int gsl_heapsort_index (size_t * p, const void * ARRAY, size_t COUNT, size_t SIZE, gsl_comparison_fn_t COMPARE) This function indirectly sorts the COUNT members of the array ARRAY, each of size SIZE, into ascending order using the comparision function COMPARE. The resulting permutation is stored in P, an array of length N. The elements of P give the index of the array element which would have been stored in that position if the array had been sorted in place. The first element of P gives the index of the least element in ARRAY, and the last element of P gives the index of the greatest element in ARRAY. The array itself is not changed. File: gsl-ref.info, Node: Sorting vectors, Next: Computing the rank, Prev: Sorting objects, Up: Sorting Sorting vectors =============== The following functions will sort the elements of an array or vector, either directly or indirectly. They are defined for all real and integer types using the normal suffix rules. For example, the `float' versions of the array functions are `gsl_sort_float' and `gsl_sort_float_index'. The corresponding vector functions are `gsl_sort_vector_float' and `gsl_sort_vector_float_index'. The prototypes are available in the header files `gsl_sort_float.h' `gsl_sort_vector_float.h'. The complete set of prototypes can be included using the header files `gsl_sort.h' and `gsl_sort_vector.h'. There are no functions for sorting complex arrays or vectors, since the ordering of complex numbers is not uniquely defined. To sort a complex vector by magnitude compute a real vector containing the the magnitudes of the complex elements, and sort this vector indirectly. The resulting index gives the appropriate ordering of the original complex vector. - Function: void gsl_sort_vector (gsl_vector * V) This function sorts the elements of the vector V into ascending numerical order. - Function: void gsl_sort (double * DATA, size_t STRIDE, size_t N) This function sorts the N elements of the array DATA with stride STRIDE into ascending numerical order. - Function: int gsl_sort_vector_index (gsl_permutation * P, const gsl_vector * V) This function indirectly sorts the elements of the vector V into ascending order, storing the resulting permutation in P. The elements of P give the index of the vector element which would have been stored in that position if the vector had been sorted in place. The first element of P gives the index of the least element in V, and the last element of P gives the index of the greatest element in V. The vector V is not changed. - Function: int gsl_sort_index (size_t * P, const double * DATA, size_t STRIDE, size_t N) This function indirectly sorts the N elements of the array DATA with stride STRIDE into ascending order, storing the resulting permutation in P. The array P must be allocated to a sufficient length to store the N elements of the permutation. The elements of P give the index of the array element which would have been stored in that position if the array had been sorted in place. The array DATA is not changed. File: gsl-ref.info, Node: Computing the rank, Next: Sorting Examples, Prev: Sorting vectors, Up: Sorting Computing the rank ================== The "rank" of an element is its order in the sorted data. The rank is the inverse of the index permutation, P. It can be computed using the following algorithm, for (i = 0; i < p->size; i++) { size_t pi = p->data[i]; rank->data[pi] = i; } This can be computed directly from the function `gsl_permutation_invert(rank,p)'. The following function will print the rank of each element of the vector V, void print_rank (gsl_vector * v) { size_t i; gsl_permutation * perm = gsl_permutation_alloc(v->size); gsl_permutation * rank = gsl_permutation_alloc(v->size); gsl_sort_vector_index (perm, v); gsl_permutation_invert (rank, perm); for (i = 0; i < v->size; i++) { double vi = gsl_vector_get(v, i); printf("element = %d, value = % .5f, rank = %d\n", i, vi, rank->data[i]); } gsl_permutation_free (perm); gsl_permutation_free (rank); } File: gsl-ref.info, Node: Sorting Examples, Next: Sorting References and Further Reading, Prev: Computing the rank, Up: Sorting Examples ======== The following example shows how to use the permutation P to print the elements of the vector V in ascending order, gsl_sort_vector_index (p, v); for (i = 0; i < v->size; i++) { double vpi = gsl_vector_get(v, p->data[i]); printf("order = %d, value = %g\n", i, vpi); } File: gsl-ref.info, Node: Sorting References and Further Reading, Prev: Sorting Examples, Up: Sorting References and Further Reading ============================== The subject of sorting is covered extensively in Knuth's `Sorting and Searching', Donald E. Knuth, `The Art of Computer Programming: Sorting and Searching' (Vol 3, 3rd Ed, 1997), Addison-Wesley, ISBN 0201896850. The Heapsort algorithm is described in the following book, Robert Sedgewick, `Algorithms in C', Addison-Wesley, ISBN 0201514257. File: gsl-ref.info, Node: Statistics, Next: Histograms, Prev: Sorting, Up: Top Statistics ********** This chapter describes the statistical functions in the library. The basic statistical functions include routines to compute the mean, variance and standard deviation. More advanced functions allow you to calculate absolute deviations, skewness, and kurtosis as well as the median and arbitrary percentiles. Statistical tests for comparing different datasets, such as the t-test, are also included. The functions are available in versions for datasets in the standard floating-point and integer types. The versions for double precision floating-point data have the prefix `gsl_stats' and the versions for integer data have the prefix `gsl_stats_int'. The algorithms use recurrence relations to compute average quantities in a stable way, without large intermediate values that might overflow. * Menu: * Mean and standard deviation and variance:: * Absolute deviation:: * Higher moments (skewness and kurtosis):: * Autocorrelation:: * Covariance:: * Weighted Samples:: * Maximum and Minimum values:: * Median and Percentiles:: * Statistical tests:: * Example statistical programs:: * Statistics References and Further Reading:: File: gsl-ref.info, Node: Mean and standard deviation and variance, Next: Absolute deviation, Up: Statistics Mean, Standard Deviation and Variance ===================================== - Statistics: double gsl_stats_mean (const double DATA[], size_t STRIDE, size_t N) This function returns the arithmetic mean of DATA, a dataset of length N with stride STRIDE. The arithmetic mean, or "sample mean", is denoted by \Hat\mu and defined as, \Hat\mu = (1/N) \sum x_i where x_i are the elements of the dataset DATA. For samples drawn from a gaussian distribution the variance of \Hat\mu is \sigma^2 / N. - Statistics: double gsl_stats_variance (const double DATA[], size_t STRIDE, size_t N) This function returns the estimated, or "sample", variance of DATA, a dataset of length N with stride STRIDE. The estimated variance is denoted by \Hat\sigma^2 and is defined by, \Hat\sigma^2 = (1/(N-1)) \sum (x_i - \Hat\mu)^2 where x_i are the elements of the dataset DATA. Note that the normalization factor of 1/(N-1) results from the derivation of \Hat\sigma^2 as an unbiased estimator of the population variance \sigma^2. For samples drawn from a gaussian distribution the variance of \Hat\sigma^2 itself is 2 \sigma^4 / N. This function computes the mean via a call to `gsl_stats_mean'. If you have already computed the mean then you can pass it directly to `gsl_stats_variance_m'. - Statistics: double gsl_stats_variance_m (const double DATA[], size_t STRIDE, size_t N, double MEAN) This function returns the sample variance of DATA relative to the given value of MEAN. The function is computed with \Hat\mu replaced by the value of MEAN that you supply, \Hat\sigma^2 = (1/(N-1)) \sum (x_i - mean)^2 - Statistics: double gsl_stats_sd (const double DATA[], size_t STRIDE, size_t N) - Statistics: double gsl_stats_sd_m (const double DATA[], size_t STRIDE, size_t N, double MEAN) The standard deviation is defined as the square root of the variance. These functions return the square root of the corresponding variance functions above. - Statistics: double gsl_stats_variance_with_fixed_mean (const double DATA[], size_t STRIDE, size_t N, double MEAN) This function computes an unbiased estimate of the variance of DATA when the population mean MEAN of the underlying distribution is known _a priori_. In this case the estimator for the variance uses the factor 1/N and the sample mean \Hat\mu is replaced by the known population mean \mu, \Hat\sigma^2 = (1/N) \sum (x_i - \mu)^2 - Statistics: double gsl_stats_sd_with_fixed_mean (const double DATA[], size_t STRIDE, size_t N, double MEAN) This function calculates the standard deviation of DATA for a a fixed population mean MEAN. The result is the square root of the corresponding variance function. File: gsl-ref.info, Node: Absolute deviation, Next: Higher moments (skewness and kurtosis), Prev: Mean and standard deviation and variance, Up: Statistics Absolute deviation ================== - Statistics: double gsl_stats_absdev (const double DATA[], size_t STRIDE, size_t N) This function computes the absolute deviation from the mean of DATA, a dataset of length N with stride STRIDE. The absolute deviation from the mean is defined as, absdev = (1/N) \sum |x_i - \Hat\mu| where x_i are the elements of the dataset DATA. The absolute deviation from the mean provides a more robust measure of the width of a distribution than the variance. This function computes the mean of DATA via a call to `gsl_stats_mean'. - Statistics: double gsl_stats_absdev_m (const double DATA[], size_t STRIDE, size_t N, double MEAN) This function computes the absolute deviation of the dataset DATA relative to the given value of MEAN, absdev = (1/N) \sum |x_i - mean| This function is useful if you have already computed the mean of DATA (and want to avoid recomputing it), or wish to calculate the absolute deviation relative to another value (such as zero, or the median). File: gsl-ref.info, Node: Higher moments (skewness and kurtosis), Next: Autocorrelation, Prev: Absolute deviation, Up: Statistics Higher moments (skewness and kurtosis) ====================================== - Statistics: double gsl_stats_skew (const double DATA[], size_t STRIDE, size_t N) This function computes the skewness of DATA, a dataset of length N with stride STRIDE. The skewness is defined as, skew = (1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^3 where x_i are the elements of the dataset DATA. The skewness measures the asymmetry of the tails of a distribution. The function computes the mean and estimated standard deviation of DATA via calls to `gsl_stats_mean' and `gsl_stats_sd'. - Statistics: double gsl_stats_skew_m_sd (const double DATA[], size_t STRIDE, size_t N, double MEAN, double SD) This function computes the skewness of the dataset DATA using the given values of the mean MEAN and standard deviation SD, skew = (1/N) \sum ((x_i - mean)/sd)^3 These functions are useful if you have already computed the mean and standard deviation of DATA and want to avoid recomputing them. - Statistics: double gsl_stats_kurtosis (const double DATA[], size_t STRIDE, size_t N) This function computes the kurtosis of DATA, a dataset of length N with stride STRIDE. The kurtosis is defined as, kurtosis = ((1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^4) - 3 The kurtosis measures how sharply peaked a distribution is, relative to its width. The kurtosis is normalized to zero for a gaussian distribution. - Statistics: double gsl_stats_kurtosis_m_sd (const double DATA[], size_t STRIDE, size_t N, double MEAN, double SD) This function computes the kurtosis of the dataset DATA using the given values of the mean MEAN and standard deviation SD, kurtosis = ((1/N) \sum ((x_i - mean)/sd)^4) - 3 This function is useful if you have already computed the mean and standard deviation of DATA and want to avoid recomputing them. File: gsl-ref.info, Node: Autocorrelation, Next: Covariance, Prev: Higher moments (skewness and kurtosis), Up: Statistics Autocorrelation =============== - Function: double gsl_stats_lag1_autocorrelation (const double data[], const size_t STRIDE, const size_t N) This function computes the lag-1 autocorrelation of the dataset DATA. a_1 = {\sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i-1} - \Hat\mu) \over \sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i} - \Hat\mu)} - Function: double gsl_stats_lag1_autocorrelation_m (const double data[], const size_t STRIDE, const size_t N, const double MEAN) File: gsl-ref.info, Node: Covariance, Next: Weighted Samples, Prev: Autocorrelation, Up: Statistics Covariance ========== - Function: double gsl_stats_covariance (const double DATA1[], const size_t STRIDE1, const double data2[], const size_t STRIDE2, const size_t N) This function computes the covariance of the datasets DATA1 and DATA2 which must both be of the same length N. covar = (1/(n - 1)) \sum_{i = 1}^{n} (x_i - \Hat x) (y_i - \Hat y) - Function: double gsl_stats_covariance_m (const double DATA1[], const size_t STRIDE1, const double DATA2[], const size_t N, const double MEAN1, const double MEAN2) File: gsl-ref.info, Node: Weighted Samples, Next: Maximum and Minimum values, Prev: Covariance, Up: Statistics Weighted Samples ================ The functions described in this section allow the computation of statistics for weighted samples. The functions accept an array of samples, x_i, with associated weights, w_i. Each sample x_i is considered as having been drawn from a Gaussian distribution with variance \sigma_i^2. The sample weight w_i is defined as the reciprocal of this variance, w_i = 1/\sigma_i^2. Setting a weight to zero corresponds to removing a sample from a dataset. - Statistics: double gsl_stats_wmean (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N) This function returns the weighted mean of the dataset DATA with stride STRIDE and length N, using the set of weights W with stride WSTRIDE and length N. The weighted mean is defined as, \Hat\mu = (\sum w_i x_i) / (\sum w_i) - Statistics: double gsl_stats_wvariance (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N) This function returns the estimated variance of the dataset DATA with stride STRIDE and length N, using the set of weights W with stride WSTRIDE and length N. The estimated variance of a weighted dataset is defined as, \Hat\sigma^2 = ((\sum w_i)/((\sum w_i)^2 - \sum (w_i^2))) \sum w_i (x_i - \Hat\mu)^2 Note that this expression reduces to an unweighted variance with the familiar 1/(N-1) factor when there are N equal non-zero weights. - Statistics: double gsl_stats_wvariance_m (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N, double WMEAN) - Statistics: double gsl_stats_wsd (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N) The standard deviation is defined as the square root of the variance. This function returns the square root of the corresponding variance function above. - Statistics: double gsl_stats_wsd_m (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N, double WMEAN) - Statistics: double gsl_stats_wvariance_with_fixed_mean (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N) This function computes an unbiased estimate of the variance of weighted dataset DATA when the population mean MEAN of the underlying distribution is known _a priori_. In this case the estimator for the variance replaces the sample mean \Hat\mu by the known population mean \mu, \Hat\sigma^2 = (\sum w_i (x_i - \mu)^2) / (\sum w_i) - Statistics: double gsl_stats_wsd_with_fixed_mean (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N) The standard deviation is defined as the square root of the variance. This function returns the square root of the corresponding variance function above. - Statistics: double gsl_stats_wabsdev (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N) This function computes the absolute deviation from the mean of DATA. The absolute deviation from the mean is defined as, absdev = (\sum w_i |x_i - \Hat\mu|) / (\sum w_i) - Statistics: double gsl_stats_wabsdev_m (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N, double WMEAN) - Statistics: double gsl_stats_wskew (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N) skew = (\sum w_i ((x_i - xbar)/\sigma)^3) / (\sum w_i) - Statistics: double gsl_stats_wskew_m_sd (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N, double WMEAN, double WSD) - Statistics: double gsl_stats_wkurtosis (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N) kurtosis = ((\sum w_i ((x_i - xbar)/sigma)^4) / (\sum w_i)) - 3 - Statistics: double gsl_stats_wkurtosis_m_sd (const double W[], size_t WSTRIDE, const double DATA[], size_t STRIDE, size_t N, double WMEAN, double WSD) File: gsl-ref.info, Node: Maximum and Minimum values, Next: Median and Percentiles, Prev: Weighted Samples, Up: Statistics Maximum and Minimum values ========================== - Statistics: double gsl_stats_max (const double DATA[], size_t STRIDE, size_t N) This function returns the maximum value in DATA, a dataset of length N with stride STRIDE. The maximum value is defined as the value of the element x_i which satisfies x_i >= x_j for all j. If you want instead to find the element with the largest absolute magnitude you will need to apply `fabs' or `abs' to your data before calling this function. - Statistics: double gsl_stats_min (const double DATA[], size_t STRIDE, size_t N) This function returns the minimum value in DATA, a dataset of length N with stride STRIDE. The minimum value is defined as the value of the element x_i which satisfies x_i <= x_j for all j. If you want instead to find the element with the smallest absolute magnitude you will need to apply `fabs' or `abs' to your data before calling this function. - Statistics: void gsl_stats_minmax (double * MIN, double * MAX, const double DATA[], size_t STRIDE, size_t N) This function finds both the minimum and maximum values MIN, MAX in DATA in a single pass. - Statistics: size_t gsl_stats_max_index (const double DATA[], size_t STRIDE, size_t N) This function returns the index of the maximum value in DATA, a dataset of length N with stride STRIDE. The maximum value is defined as the value of the element x_i which satisfies x_i >= x_j for all j. When there are several equal maximum elements then the first one is chosen. - Statistics: size_t gsl_stats_min_index (const double DATA[], size_t STRIDE, size_t N) This function returns the index of the minimum value in DATA, a dataset of length N with stride STRIDE. The minimum value is defined as the value of the element x_i which satisfies x_i >= x_j for all j. When there are several equal minimum elements then the first one is chosen. - Statistics: void gsl_stats_minmax_index (size_t * MIN_INDEX, size_t * MAX_INDEX, const double DATA[], size_t STRIDE, size_t N) This function returns the indexes MIN_INDEX, MAX_INDEX of the minimum and maximum values in DATA in a single pass. File: gsl-ref.info, Node: Median and Percentiles, Next: Statistical tests, Prev: Maximum and Minimum values, Up: Statistics Median and Percentiles ====================== The median and percentile functions described in this section operate on sorted data. For convenience we use "quantiles", measured on a scale of 0 to 1, instead of percentiles (which use a scale of 0 to 100). - Statistics: double gsl_stats_median_from_sorted_data (const double SORTED_DATA[], size_t STRIDE, size_t N) This function returns the median value of SORTED_DATA, a dataset of length N with stride STRIDE. The elements of the array must be in ascending numerical order. There are no checks to see whether the data are sorted, so the function `gsl_sort' should always be used first. When the dataset has an odd number of elements the median is the value of element (n-1)/2. When the dataset has an even number of elements the median is the mean of the two nearest middle values, elements (n-1)/2 and n/2. Since the algorithm for computing the median involves interpolation this function always returns a floating-point number, even for integer data types. - Statistics: double gsl_stats_quantile_from_sorted_data (const double SORTED_DATA[], size_t STRIDE, size_t N, double F) This function returns a quantile value of SORTED_DATA, a double-precision array of length N with stride STRIDE. The elements of the array must be in ascending numerical order. The quantile is determined by the F, a fraction between 0 and 1. For example, to compute the value of the 75th percentile F should have the value 0.75. There are no checks to see whether the data are sorted, so the function `gsl_sort' should always be used first. The quantile is found by interpolation, using the formula quantile = (1 - \delta) x_i + \delta x_{i+1} where i is `floor'((n - 1)f) and \delta is (n-1)f - i. Thus the minimum value of the array (`data[0*stride]') is given by F equal to zero, the maximum value (`data[(n-1)*stride]') is given by F equal to one and the median value is given by F equal to 0.5. Since the algorithm for computing quantiles involves interpolation this function always returns a floating-point number, even for integer data types. File: gsl-ref.info, Node: Statistical tests, Next: Example statistical programs, Prev: Median and Percentiles, Up: Statistics Statistical tests ================= FIXME, do more work on the statistical tests File: gsl-ref.info, Node: Example statistical programs, Next: Statistics References and Further Reading, Prev: Statistical tests, Up: Statistics Example statistical programs ============================ Here is a basic example of how to use the statistical functions: #include <stdio.h> #include <gsl/gsl_statistics.h> int main() { double data[5] = {17.2, 18.1, 16.5, 18.3, 12.6} ; double mean, variance, largest, smallest; mean = gsl_stats_mean(data, 1, 5); variance = gsl_stats_variance(data, 1, 5); largest = gsl_stats_max(data, 1, 5); smallest = gsl_stats_min(data, 1, 5); printf("The dataset is %g, %g, %g, %g, %g\n", data[0], data[1], data[2], data[3], data[4]); printf("The sample mean is %g\n", mean) ; printf("The estimated variance is %g\n", variance) ; printf("The largest value is %g\n", largest) ; printf("The smallest value is %g\n", smallest) ; } The program should produce the following output, The dataset is 17.2, 18.1, 16.5, 18.3, 12.6 The sample mean is 16.54 The estimated variance is 4.2984 The largest value is 18.3 The smallest value is 12.6 Here is an example using sorted data, #include <stdio.h> #include <gsl/gsl_sort.h> #include <gsl/gsl_statistics.h> int main() { double data[5] = {17.2, 18.1, 16.5, 18.3, 12.6} ; double median, upperq, lowerq; printf("The original dataset is %g, %g, %g, %g, %g\n", data[0], data[1], data[2], data[3], data[4]); gsl_sort (data, 1, 5) ; printf("The sorted dataset is %g, %g, %g, %g, %g\n", data[0], data[1], data[2], data[3], data[4]); median = gsl_stats_median_from_sorted_data(data, 1, 5); upperq = gsl_stats_quantile_from_sorted_data(data, 1, 5, 0.75); lowerq = gsl_stats_quantile_from_sorted_data(data, 1, 5, 0.25); printf("The median is %g\n", median) ; printf("The upper quartile is %g\n", upperq) ; printf("The lower quartile is %g\n", lowerq) ; } This program should produce the following output, The original dataset is 17.2, 18.1, 16.5, 18.3, 12.6 The sorted dataset is 12.6, 16.5, 17.2, 18.1, 18.3 The median is 17.2 The upper quartile is 18.1 The lower quartile is 16.5 File: gsl-ref.info, Node: Statistics References and Further Reading, Prev: Example statistical programs, Up: Statistics References and Further Reading ============================== The standard reference for almost any topic in statistics is the multi-volume `Advanced Theory of Statistics' by Kendall and Stuart. Maurice Kendall, Alan Stuart, and J. Keith Ord. `The Advanced Theory of Statistics' (multiple volumes) reprinted as `Kendall's Advanced Theory of Statistics'. Wiley, ISBN 047023380X. Many statistical concepts can be more easily understood by a Bayesian approach. The following book by Gelman, Carlin, Stern and Rubin gives a comprehensive coverage of the subject. Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin. `Bayesian Data Analysis'. Chapman & Hall, ISBN 0412039915. For physicists the Particle Data Group provides useful reviews of Probability and Statistics in the "Mathematical Tools" section of its Annual Review of Particle Physics. `Review of Particle Properties' R.M. Barnett et al., Physical Review D54, 1 (1996) The Review of Particle Physics is available online at <http://pdg.lbl.gov/>. File: gsl-ref.info, Node: Histograms, Next: One dimensional Root-Finding, Prev: Statistics, Up: Top Histograms ********** This chapter describes functions for creating histograms. Histograms provide a convenient way of summarizing the distribution of a set of data. A histogram consists of a set of "bins" which count the number of events falling into a given range of a continuous variable x. In GSL the bins of a histogram contain floating-point numbers, so they can be used to record both integer and non-integer distributions. The bins can use arbitrary sets of ranges (uniformly spaced bins are the default). Both one and two-dimensional histograms are supported. Once a histogram has been created it can also be converted into a probability distribution function. The library provides efficient routines for selecting random samples from probability distributions. This can be useful for generating simulations based real data. * Menu: * The histogram struct:: * Histogram allocation:: * Copying Histograms:: * Updating and accessing histogram elements:: * Searching histogram ranges:: * Histogram Statistics:: * Histogram Operations:: * Reading and writing histograms:: * Resampling from histograms:: * The histogram probability distribution struct:: * Example programs for histograms:: * Two dimensional histograms:: * The 2D histogram struct:: * 2D Histogram allocation:: * Copying 2D Histograms:: * Updating and accessing 2D histogram elements:: * Searching 2D histogram ranges:: * 2D Histogram Statistics:: * 2D Histogram Operations:: * Reading and writing 2D histograms:: * Resampling from 2D histograms:: * Example programs for 2D histograms:: File: gsl-ref.info, Node: The histogram struct, Next: Histogram allocation, Up: Histograms The histogram struct ==================== A histogram is defined by the following struct, - Data Type: gsl_histogram `size_t n' This is the number of histogram bins `double * range' The ranges of the bins are stored in an array of N+1 elements pointed to by RANGE. `double * bin' The counts for each bin are stored in an array of N elements pointed to by BIN. The bins are floating-point numbers, so you can increment them by non-integer values if necessary. The range for BIN[i] is given by RANGE[i] to RANGE[i+1]. For n bins there are n+1 entries in the array RANGE. Each bin is inclusive at the lower end and exclusive at the upper end. Mathematically this means that the bins are defined by the following inequality, bin[i] corresponds to range[i] <= x < range[i+1] Here is a diagram of the correspondence between ranges and bins on the number-line for x, r[0] r[1] r[2] r[3] r[4] r[5] ---|---------|---------|---------|---------|---------|--- x [ bin[0] )[ bin[1] )[ bin[2] )[ bin[3] )[ bin[5] ) In this picture the values of the RANGE array are denoted by r. On the left-hand side of each bin the square bracket "`['" denotes an inclusive lower bound (r <= x), and the round parentheses "`)'" on the right-hand side denote an exclusive upper bound (x < r). Thus any samples which fall on the upper end of the histogram are excluded. If you want to include this value for the last bin you will need to add an extra bin to your histogram. The `gsl_histogram' struct and its associated functions are defined in the header file `gsl_histogram.h'. File: gsl-ref.info, Node: Histogram allocation, Next: Copying Histograms, Prev: The histogram struct, Up: Histograms Histogram allocation ==================== The functions for allocating memory to a histogram follow the style of `malloc' and `free'. In addition they also perform their own error checking. If there is insufficient memory available to allocate a histogram then the functions call the GSL error handler (with an error number of `GSL_ENOMEM') in addition to returning a null pointer. Thus if you use the library error handler to abort your program then it isn't necessary to check every histogram `alloc'. - Function: gsl_histogram * gsl_histogram_calloc (size_t N) This function allocates memory for a histogram with N bins, and returns a pointer to its newly initialized `gsl_histogram' struct. The bins are uniformly spaced with a total range of 0 <= x < n, as shown in the table below. bin[0] corresponds to 0 \le x < 1 bin[1] corresponds to 1 \le x < 2 ...... bin[n-1] corresponds to n-1 \le x < n The bins are initialized to zero so the histogram is ready for use. If insufficient memory is available a null pointer is returned and the error handler is invoked with an error code of `GSL_ENOMEM'. - Function: gsl_histogram * gsl_histogram_calloc_uniform (size_t N, double XMIN, double XMAX) This function allocates memory for a histogram with N uniformly spaced bins from XMIN to XMAX, and returns a pointer to the newly initialized `gsl_histogram' struct. The bins are shown in the table below, bin[0] corresponds to xmin \le x < xmin + d bin[1] corresponds to xmin + d \le x < xmin + 2 d ...... bin[n-1] corresponds to xmin + (n-1)d \le x < xmax where d is the bin spacing, (xmax-xmin)/n. Each bin is initialized to zero. If insufficient memory is available a null pointer is returned and the error handler is invoked with an error code of `GSL_ENOMEM'. To create a histogram with non-uniform bins you will need to call `gsl_histogram_calloc' to prepare a new histogram struct and then modify the `range' array to use your desired bin limits. The ranges can be arbitrary, subject to the restriction that they are monotonically increasing. For example, the following code fragment shows how to create a histogram with logarithmic bins from 1--10, 10--100 and 100--1000. gsl_histogram * h = gsl_histogram_calloc (3) ; h->range[0] = 1.0 ; /* bin[0] covers the range 1 <= x < 10 */ h->range[1] = 10.0 ; /* bin[1] covers the range 10 <= x < 100 */ h->range[2] = 100.0 ; /* bin[2] covers the range 100 <= x < 1000 */ h->range[3] = 1000.0 ; Note that the size of the RANGE array is automatically defined as `double range[4]' by `gsl_histogram_calloc', and is one element bigger than the array of bins `double bin[3]'. Thus the range array safely includes extra space for the final upper value, RANGE[3]. - Function: gsl_histogram * gsl_histogram_calloc_range (size_t N, double * RANGE) This function allocates a histogram of size N using the n+1 bin ranges specified by the array RANGE. - Function: void gsl_histogram_free (gsl_histogram * h) This function frees the histogram H and all of the memory associated with it. File: gsl-ref.info, Node: Copying Histograms, Next: Updating and accessing histogram elements, Prev: Histogram allocation, Up: Histograms Copying Histograms ================== - Function: int gsl_histogram_memcpy (gsl_histogram * DEST, const gsl_histogram * SRC) This function copies the histogram SRC into the pre-existing histogram DEST, making DEST into an exact copy of SRC. The two histograms must be of the same size. - Function: gsl_histogram * gsl_histogram_clone (const gsl_histogram * SRC) This function returns a pointer to a newly created histogram which is an exact copy of the histogram SRC. File: gsl-ref.info, Node: Updating and accessing histogram elements, Next: Searching histogram ranges, Prev: Copying Histograms, Up: Histograms Updating and accessing histogram elements ========================================= There are two ways to access histogram bins, either by specifying an x coordinate or by using the bin-index directly. The functions for accessing the histogram through x coordinates use a binary search to identify the bin which covers the appropriate range. - Function: int gsl_histogram_increment (gsl_histogram * h, double x) This function updates the histogram H by adding one (1.0) to the bin whose range contains the coordinate X. If X lies in the valid range of the histogram then the function returns zero to indicate success. If X is less than the lower limit of the histogram then the function returns `GSL_EDOM', and none of bins are modified. Similarly, if the value of X is greater than or equal to the upper limit of the histogram then the function returns `GSL_EDOM', and none of the bins are modified. The error handler is not called, however, since it is often necessary to compute histogram for a small range of a larger dataset, ignoring the values outside the range of interest. - Function: int gsl_histogram_accumulate (gsl_histogram * H, double X, double WEIGHT) This function is similar to `gsl_histogram_increment' but increases the value of the appropriate bin in the histogram H by the floating-point number WEIGHT. - Function: double gsl_histogram_get (const gsl_histogram * H, size_t I) This function returns the contents of the Ith bin of the histogram H. If I lies outside the valid range of indices for the histogram then the error handler is called with an error code of `GSL_EDOM' and the function returns 0. - Function: int gsl_histogram_get_range (const gsl_histogram * H, size_t I, double * LOWER, double * UPPER) This function finds the upper and lower range limits of the Ith bin of the histogram H. If the index I is valid then the corresponding range limits are stored in LOWER and UPPER. The lower limit is inclusive (i.e. events with this coordinate are included in the bin) and the upper limit is exclusive (i.e. events with the coordinate of the upper limit are excluded and fall in the neighboring higher bin, if it exists). The function returns 0 to indicate success. If I lies outside the valid range of indices for the histogram then the error handler is called and the function returns an error code of `GSL_EDOM'. - Function: double gsl_histogram_max (const gsl_histogram * H) - Function: double gsl_histogram_min (const gsl_histogram * H) - Function: size_t gsl_histogram_bins (const gsl_histogram * H) These functions return the maximum upper and minimum lower range limits and the number of bins of the histogram H. They provide a way of determining these values without accessing the `gsl_histogram' struct directly. - Function: void gsl_histogram_reset (gsl_histogram * H) This function resets all the bins in the histogram H to zero. File: gsl-ref.info, Node: Searching histogram ranges, Next: Histogram Statistics, Prev: Updating and accessing histogram elements, Up: Histograms Searching histogram ranges ========================== The following functions are used by the access and update routines to locate the bin which corresponds to a given x coordinate. - Function: int gsl_histogram_find_impl (size_t N, const double RANGE[], double X, size_t * I) This function finds and sets the index I to the offset in the array RANGE of size N which bounds the value of X, such that range[i] \le x < range[i+1]. The binary search function `bsearch' from the system C-library is used to locate the appropriate range. If a suitable value of I is found then the function returns 0 to indicate success. If X is less than the lower limit the function returns -1, and if X is greater than or equal to the upper limit it returns +1. The error handler is not called. - Function: int gsl_histogram_find (const gsl_histogram * H, double X, size_t * I) This function uses `gsl_histogram_find_impl' to set the index I to the bin number which covers the coordinate X in the histogram H. If X is found then the function sets the index I and returns zero to indicate success. If X lies outside the valid range of the histogram then the function returns `GSL_EDOM' and the error handler is invoked. File: gsl-ref.info, Node: Histogram Statistics, Next: Histogram Operations, Prev: Searching histogram ranges, Up: Histograms Histogram Statistics ==================== - Function: double gsl_histogram_max_val (const gsl_histogram * H) This function returns the maximum value contained in the histogram bins. - Function: size_t gsl_histogram_max_bin (const gsl_histogram * H) This function returns the index of the bin containing the maximum value. In the case where several bins contain the same maximum value the smallest index is returned. - Function: double gsl_histogram_min_val (const gsl_histogram * H) This function returns the minimum value contained in the histogram bins. - Function: size_t gsl_histogram_min_bin (const gsl_histogram * H) This function returns the index of the bin containing the minimum value. In the case where several bins contain the same maximum value the smallest index is returned. - Function: double gsl_histogram_mean (const gsl_histogram * H) This function returns the mean of the histogrammed variable, where the histogram is regarded as a probability distribution. Negative bin values are ignored for the purposes of this calculation. - Function: double gsl_histogram_sigma (const gsl_histogram * H) This function returns the standard deviation of the histogrammed variable, where the histogram is regarded as a probability distribution. Negative bin values are ignored for the purposes of this calculation. File: gsl-ref.info, Node: Histogram Operations, Next: Reading and writing histograms, Prev: Histogram Statistics, Up: Histograms Histogram Operations ==================== - Function: int gsl_histogram_equal_bins_p (const gsl_histogram *H1, const gsl_histogram *H2) This function returns 1 if the all of the individual bin ranges of the two histograms are identical, and 0 otherwise. - Function: int gsl_histogram_add (gsl_histogram *H1, const gsl_histogram *H2) This function adds the contents of the bins in histogram H2 to the corresponding bins of histogram H1. i.e. h'_1(i) = h_1(i) + h_2(i). - Function: int gsl_histogram_sub (gsl_histogram *H1, const gsl_histogram *H2) This function subtracts the contents of the bins in histogram H2 from the corresponding bins of histogram H1. i.e. h'_1(i) = h_1(i) - h_2(i). - Function: int gsl_histogram_mul (gsl_histogram *H1, const gsl_histogram *H2) This function multiplies the contents of the bins of histogram H1 by the contents of the corresponding bins in histogram H2. i.e. h'_1(i) = h_1(i) * h_2(i). - Function: int gsl_histogram_div (gsl_histogram *H1, const gsl_histogram *H2) This function divides the contents of the bins of histogram H1 by the contents of the corresponding bins in histogram H2. i.e. h'_1(i) = h_1(i) / h_2(i). - Function: int gsl_histogram_scale (gsl_histogram *H, double SCALE) This function multiplies the contents of the bins of histogram H by the constant SCALE. i.e. h'_1(i) = h_1(i) scale. - Function: int gsl_histogram_shift (gsl_histogram *H, double OFFSET) This function shifts the contents of the bins of histogram H by the constant OFFSET. i.e. h'_1(i) = h_1(i) + offset. File: gsl-ref.info, Node: Reading and writing histograms, Next: Resampling from histograms, Prev: Histogram Operations, Up: Histograms Reading and writing histograms ============================== The library provides functions for reading and writing histograms to a file as binary data or formatted text. - Function: int gsl_histogram_fwrite (FILE * STREAM, const gsl_histogram * H) This function writes the ranges and bins of the histogram H to the stream STREAM in binary format. The return value is 0 for success and `GSL_EFAILED' if there was a problem writing to the file. Since the data is written in the native binary format it may not be portable between different architectures. - Function: int gsl_histogram_fread (FILE * STREAM, gsl_histogram * H) This function reads into the histogram H from the open stream STREAM in binary format. The histogram H must be preallocated with the correct size since the function uses the number of bins in H to determine how many bytes to read. The return value is 0 for success and `GSL_EFAILED' if there was a problem reading from the file. The data is assumed to have been written in the native binary format on the same architecture. - Function: int gsl_histogram_fprintf (FILE * STREAM, const gsl_histogram * H, const char * RANGE_FORMAT, const char * BIN_FORMAT) This function writes the ranges and bins of the histogram H line-by-line to the stream STREAM using the format specifiers RANGE_FORMAT and BIN_FORMAT. These should be one of the `%g', `%e' or `%f' formats for floating point numbers. The function returns 0 for success and `GSL_EFAILED' if there was a problem writing to the file. The histogram output is formatted in three columns, and the columns are separated by spaces, like this, range[0] range[1] bin[0] range[1] range[2] bin[1] range[2] range[3] bin[2] .... range[n-1] range[n] bin[n-1] The values of the ranges are formatted using RANGE_FORMAT and the value of the bins are formatted using BIN_FORMAT. Each line contains the lower and upper limit of the range of the bins and the value of the bin itself. Since the upper limit of one bin is the lower limit of the next there is duplication of these values between lines but this allows the histogram to be manipulated with line-oriented tools. - Function: int gsl_histogram_fscanf (FILE * STREAM, gsl_histogram * H) This function reads formatted data from the stream STREAM into the histogram H. The data is assumed to be in the three-column format used by `gsl_histogram_fprintf'. The histogram H must be preallocated with the correct length since the function uses the size of H to determine how many numbers to read. The function returns 0 for success and `GSL_EFAILED' if there was a problem reading from the file. File: gsl-ref.info, Node: Resampling from histograms, Next: The histogram probability distribution struct, Prev: Reading and writing histograms, Up: Histograms Resampling from histograms ========================== A histogram made by counting events can be regarded as a measurement of a probability distribution. Allowing for statistical error, the height of each bin represents the probability of an event where the value of x falls in the range of that bin. The probability distribution function has the one-dimensional form p(x)dx where, p(x) = n_i/ (N w_i) In this equation n_i is the number of events in the bin which contains x, w_i is the width of the bin and N is the total number of events. The distribution of events within each bin is assumed to be uniform.